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Abstract 

For  the  purposes  of  manipulation,  we  would 
like  to  know  what  parts  of  the  environment 
are  physically  coherent  ensembles  -  that  is, 
which  parts  will  move  together,  and  which  are 
more  or  less  independent.  It  takes  a  great 
deal  of  experience  before  this  judgement  can 
be  made  from  purely  visual  information.  This 
paper  develops  active  strategies  for  acquir¬ 
ing  that  experience  through  experimental  ma¬ 
nipulation,  using  tight  correlations  between 
arm  motion  and  optic  flow  to  detect  both  the 
arm  itself  and  the  boundaries  of  objects  with 
which  it  comes  into  contact.  We  argue  that 
following  causal  chains  of  events  out  from  the 
robot’s  body  into  the  environment  allows  for 
a  very  natural  developmental  progression  of 
visual  competence,  and  relate  this  idea  to  re¬ 
sults  in  neuroscience. 

1.  Introduction 

A  robot  is  an  actor  in  its  environment  and  not  simply 
a  passive  observer.  This  gives  it  the  potential  to  ex¬ 
amine  the  world  using  causality,  by  performing  prob¬ 
ing  actions  and  learning  from  the  response.  Tracing 
chains  of  causality  from  motor  action  to  perception 
(and  back  again)  is  important  both  to  understand 
how  the  brain  deals  with  sensorimotor  coordination 
and  to  implement  those  same  functions  in  an  artifi¬ 
cial  system,  such  as  a  humanoid  robot. 

In  this  paper,  we  propose  that  such  causal  probing 
can  be  arranged  in  a  developmental  sequence  leading 
to  a  manipulation-driven  representation  of  objects. 
We  present  results  for  two  important  steps  along  the 
way,  and  describe  how  we  plan  to  proceed. 

Table  1  shows  three  levels  of  causal  complexity. 
The  simplest  causal  chain  that  the  robot  experiences 
is  the  perception  of  its  own  actions.  The  temporal  as¬ 
pect  is  immediate:  visual  information  is  tightly  syn¬ 
chronized  to  motor  commands.  We  use  this  strong 
correlation  to  identify  parts  of  the  robot  body 
specifically,  the  end-point  of  the  arm. 

Once  this  causal  connection  is  establish**!,  we  can 
go  further  and  use  it  to  active  explore  the  bound¬ 
aries  of  objects.  In  this  case,  there  is  one  more  step 
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in  the  causal  chain,  and  the  temporal  nature  of  the 
response  may  be  delayed  since  initiating  a  reaching 
movement  doesn’t  immediately  elicit  consequences  in 
the  environment. 

Finally  we  argue  that  extending  this  causal  chain 
further  will  allow  us  to  approach  the  representational 
power  of  “mirror  neurons”  (Fadiga  et  al.,  2000), 
where  a  connection  is  made  between  our  own  actions 
and  the  actions  of  another. 

2.  The  elusive  object 

Sensory  information  is  intrinsically  ambiguous,  and 
very  distant  from  the  world  of  well-defined  objects 
in  which  humans  believe  they  five.  What  criterion 
should  be  applied  to  distinguish  one  object  from 
another?  How  can  perception  support  such  a  phe¬ 
nomenon  as  figure-ground  segmentation?  Consider 
the  example  in  Figure  1.  It  is  immediately  clear  that 
the  drawing  on  the  left  is  a  cross,  perhaps  because 
we  already  have  a  criterion,  which  allows  segmenting 
on  the  basis  of  the  intensity  difference.  It  is  slightly 
less  clear  that  the  zeros  and  ones  on  the  middle  panel 
are  still  a  cross.  What  can  we  say  about  the  array 
on  the  right?  If  we  are  not  told,  and  we  do  not  have 
the  criterion  to  perform  the  figure-ground  segmenta¬ 
tion,  we  might  think  this  is  just  a  random  collection 
of  numbers.  But  if  we  are  told  that  the  criterion  is 
“prime  numbers  vs.  non-prime”  then  a  cross  can  still 
be  identified. 

While  we  have  to  be  inventive  to  come  up  with  a 
segmentation  problem  that  tests  a  human,  we  don’t 
have  to  go  far  at  all  to  find  something  that  baffles  our 
robots.  Figure  2  shows  a  robot ’s-eye  view  of  a  cube 
sitting  on  a  table.  Simple  enough,  but  many  rules 
of  thumb  used  in  segmentation  fail  in  this  particular 
case.  And  even  an  experienced  human  observer,  di¬ 
agnosing  the  cube  as  a  separate  object  based  on  its 
shadow  and  subtle  differences  in  the  surface  texture 
of  the  cube  and  table,  could  in  fact  be  mistaken  - 
perhaps  some  malicious  researcher  is  up  to  mischief. 
The  only  way  to  find  out  for  sure  is  to  take  action, 
and  start  poking  and  prodding.  As  early  as  1734, 
Berkeley  observed  that: 

...objects  can  only  be  known  by  touch.  Vision 
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type 

nature  of  causation 

time  profile 

sensorimotor  coordination 

direct  causal  chain 

strict  synchrony 

object  probing 

one  level  of  indirection 

fast  onset  upon  contact,  poten¬ 

tial  for  delayed  effects 

mirror  representation 

complex  causation  involving 

multiple  causal  chains 

arbitrarily  delayed  onset  and  ef¬ 
fects 

Table  1:  Degrees  of  causal  indirection.  There  is  a  natural  trend  from  simpler  to  more  complicated  tasks.  The  more 
time-delayed  an  effect,  the  more  difficult  it  is  to  model. 
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a  cross  a  binary r cross 

Figure  1:  Three  examples  of  crosses,  follow¬ 

ing  (Manzotti  and  Tagliasco,  2001).  The  human 
ability  to  segment  objects  is  not  general-purpose,  and 
improves  with  experience. 
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Figure  2:  A  cube  on  a  table.  The  edges  of  the  table  and 
cube  happen  to  be  aligned  (dashed  line),  the  colors  of  the 
cube  and  table  are  not  well  separated,  and  the  cube  has 
a  potentially  confusing  surface  pattern. 


is  subject  to  illusions,  which  arise  from  the 
distance-size  problem...  (Berkeley,  1972) 

In  this  paper,  we  provide  support  for  a  more  nuanced 
proposition:  that  in  the  presence  of  touch,  vision  be¬ 
comes  more  powerful,  and  many  of  its  illusions  fade 
away. 


Objects  and  actions 

The  example  of  the  cross  composed  of  prime  num¬ 
bers  is  a  novel  (albeit  unlikely)  type  of  segmentation 
in  our  experience  as  adult  humans.  We  might  imag¬ 
ine  that  when  we  were  very  young,  we  had  to  ini¬ 
tially  form  a  set  of  such  criteria  to  solve  the  object 
identification/segmentation  problem  in  more  mun¬ 
dane  circumstances.  That  such  abilities  develop  and 
are  not  completely  innate  is  suggested  by  results  in 
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neural  science.  For  example  Kovacs  (Kovacs,  2000) 
has  shown  that  perceptual  grouping  is  slow  to  de¬ 
velop  and  continues  to  improve  well  beyond  early 
childhood  (14  years).  Long-range  contour  integra¬ 
tion  was  tested  and  this  work  elucidated  how  this 
ability  develops  to  enable  extended  spatial  grouping. 

Key  to  understanding  how  such  capabilities  could 
develop  is  the  well-known  result  by  Unger leider 
and  Mishkin  (Ungerleider  and  Mishkin,  1982)  who 
first  formulated  the  hypothesis  that  objects  are  rep¬ 
resented  differently  during  action  than  they  are 
for  a  purely  perceptual  task.  Briefly,  they  ar¬ 
gue  that  the  brain’s  visual  pathways  split  into 
two  main  streams:  the  dorsal  and  the  ven¬ 
tral  (Milner  and  Goodale,  1995).  The  dorsal  deals 
with  the  information  required  for  action,  while  the 
ventral  is  important  for  more  cognitive  tasks  such  as 
maintaining  an  object’s  identity  and  constancy.  Al¬ 
though  the  dorsal/ventral  segregation  is  emphasized 
by  many  commentators,  it  is  significant  that  there  is 
a  great  deal  of  cross  talk  between  the  streams.  Obser¬ 
vation  of  agnosic  patients  (Jeannerod,  1997)  shows  a 
much  more  complicated  relationship  than  the  simple 
dorsal/ ventral  dichotomy  would  suggest.  For  exam¬ 
ple,  although  some  patients  could  not  grasp  generic 
objects  (e.g.  cylinders),  they  could  correctly  pre¬ 
shape  the  hand  to  grasp  known  objects  (e.g.  a  lip¬ 
stick):  interpreted  in  terms  of  the  two  pathways,  this 
implies  that  the  ventral  representation  of  the  object 
can  supply  the  dorsal  stream  with  size  information. 

The  dorsal  stream  goes  through  the  parietal  lobe 
and  premotor  cortex,  which  project  heavily  onto  the 
primary  motor  cortex  to  eventually  control  move¬ 
ments.  For  many  years  the  premotor  cortex  was 
considered  just  another  big  motor  area.  Recent  stud¬ 
ies  (Jeannerod,  1997)  have  demonstrated  that  this  is 
not  the  case.  Visually  responsive  neurons  have  been 
found:  some  are  purely  visual,  but  many  have  sig¬ 
nificant  visuo-motor  characteristics.  In  area  F5  in 
the  monkey,  neurons  responding  to  object  manipula¬ 
tion  gestures  are  found.  They  can  be  classified  in  at 
least  two  different  types:  canonical  and  mirror.  The 
canonical  type  is  active  in  two  situations:  i)  when 
grasping  an  object  and  ii)  when  fixating  that  same 
object.  For  example,  a  neuron  active  when  grasping 
a  ring  also  fires  when  the  monkey  simply  looks  at  the 
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ring.  This  could  be  thought  of  as  a  neural  analogue 
of  the  “affordances”  of  Gibson  (Gibson,  1977). 

The  second  type  of  neuron,  the  mirror  neuron 
(Fadiga  et  al.,  2000),  becomes  active  under  two  con¬ 
ditions:  i)  when  manipulating  an  object  (e.g.  grasp¬ 
ing  it),  and  ii)  when  watching  someone  else  perform¬ 
ing  the  same  action  on  the  same  object.  This  is 
a  more  subtle  representation  of  objects,  which  al¬ 
lows  and  supports,  at  least  in  theory,  mimicry  be¬ 
haviors.  In  human,  area  F5  is  thought  to  correspond 
to  Broca’s  area:  there  is  an  intriguing  link  between 
gesture  understanding,  language,  imitation,  and  mir¬ 
ror  neurons  (Rizzolatti  and  Arbib,  1998). 

Another  important  class  of  neurons  in  premo¬ 
tor  cortex  is  found  in  area  F4  (Fogassi  et  al.,  1996). 
While  F5  is  more  concerned  with  the  distal  muscles 
(i.e.  the  hand),  F4  controls  more  proximal  muscles 
(i.e.  reaching).  A  subset  of  neurons  in  F4  has  a  so¬ 
matosensory,  visual,  and  motor  receptive  field.  The 
visual  receptive  field  (RF)  extends  in  3D  from  a  given 
body  part,  for  example,  the  forearm.  The  somatosen¬ 
sory  RF  is  usually  in  register  with  the  visual  one.  Fi¬ 
nally,  motor  information  is  integrated  into  the  repre¬ 
sentation  by  maintaining  the  receptive  field  anchored 
to  the  correspondent  body  part  (the  forearm  in  this 
example)  irrespective  of  the  relative  position  of  the 
head  and  arm. 

A  working  hypothesis 

Taken  together  this  results  from  neuroscience  sug¬ 
gest  a  very  basic  role  for  motor  action.  Certainly 
vision  and  action  are  intertwined  at  a  very  basic 
level.  While  an  experienced  adult  can  interpret  vi¬ 
sual  scenes  perfectly  well  without  acting  upon  them, 
linking  action  and  perception  seems  crucial  to  the  de¬ 
velopmental  process  that  leads  to  that  competence. 
We  can  construct  a  working  hypothesis:  that  action 
is  required  to  object  recognition  in  cases  where  an 
agent  has  to  develop  categorization  autonomously. 
Of  course  in  standard  supervised  learning  action  is 
not  required  since  the  trainer  does  the  job  of  pre¬ 
segmenting  the  data  by  hand.  In  an  ecological  con¬ 
text,  some  other  mechanism  has  to  be  provided. 
Ultimately  this  mechanism  is  the  body  itself  that 
through  action  (under  some  suitable  developmental 
rule)  generates  informative  percepts. 

Neurons  in  area  F4  are  thought  to  provide  a  body 
map  useful  for  generating  arm,  head,  and  trunk 
movements.  Our  robot  learns  autonomously  a  crude 
version  of  this  body  map  by  fusing  vision  and  pro¬ 
prioception.  As  a  step  towards  establishing  the  kind 
of  visuomotor  representations  seen  in  F5,  we  then 
develop  a  mechanism  for  using  reaching  actions  to 
visually  probe  the  connectivity  and  physical  extent 
of  objects  without  any  prior  knowledge  of  the  ap¬ 
pearance  of  the  objects  (or  indeed  of  the  arm  itself). 


3.  The  experimental  platform 

This  work  is  implemented  on  the  robot  Cog,  an 
upper  torso  humanoid  (Brooks  et  al.,  1999).  The 
robot  has  previously  been  applied  to  tasks  such  as 
visually-guided  pointing  (Marjanovic  et  al.,  1996), 
and  rhythmic  operations  such  as  turning  a  crank  or 
driving  a  slinky  (Williamson,  1998).  Cog  has  two 
arms,  each  of  which  has  six  degrees  of  freedom 
two  per  shoulder,  elbow,  and  wrist.  The  joints  are 
driven  by  series  elastic  actuators  (Williamson,  1995) 
-  essentially  a  motor  connected  to  its  load  via  a 
spring  (think  strong  and  torsional  rather  than  loosely 
coiled).  The  arm  is  not  designed  to  enact  trajectories 
with  high  fidelity.  For  that  a  very  stiff  arm  is  prefer¬ 
able.  Rather,  it  is  designed  to  perform  well  when 
interacting  with  a  poorly  characterized  environment, 
where  collisions  are  frequent  and  informative  events. 


Figure  3:  Degrees  of  freedom  (DOFs)  of  the  robot  Cog. 
The  arms  terminate  either  in  a  primitive  “flipper”  or  a 
four-fingered  hand.  The  head,  torso,  and  arms  together 
contain  22  degrees  of  freedom. 


4.  Perceiving  direct  effects  of  action 

Motion  of  the  arm  may  generate  optic  flow  directly 
through  the  changing  projection  of  the  arm  itself, 
or  indirectly  through  an  object  that  the  arm  is  in 
contact  with.  While  the  relationship  between  the 
optic  flow  and  the  physical  motion  Ls  likely  to  be  ex¬ 
tremely  complex,  the  correlation  in  time  of  the  two 
events  will  generally  be  exceedingly  precise.  This 
time-correlation  can  be  used  as  a  “signature”  to  iden¬ 
tify  parts  of  the  scene  that  are  being  influenced  by 
the  robot’s  motion,  even  in  the  presence  of  other  dis¬ 
tracting  motion  sources.  In  this  section,  we  show 
how  this  tight  correlation  can  be  used  to  localize 
the  arm  in  the  image  without  any  prior  information 
about  visual  appearance.  In  the  next  section  we  will 
show  that  once  the  arm  has  been  localized  we  can  go 
further,  and  identify  the  boundaries  of  objects  with 
which  the  arm  comes  into  contact. 
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Figure  4:  An  example  of  the  correlation  between  optic 
flow  and  arm  movement.  The  traces  show  the  movement 
of  the  wrist  joint  (upper  plot)  and  optic  flow  sampled  on 
the  arm  (middle  plot)  and  away  from  it  (lower  plot).  As 
the  arm  generates  a  repetitive  movement,  the  oscillation 
is  clearly  visible  in  the  middle  plot  and  absent  in  the 
lower.  Before  and  after  the  movement  the  head  is  free 
to  saccade,  generating  the  other  spikes  seen  in  the  optic 
flow. 


Reaching  out 

The  first,  step  towards  manipulation  is  to  reach  ob¬ 
jects  within  the  workspace.  If  we  assume  targets  are 
chosen  visually,  then  ideally  we  need  to  also  locate 
the  end-effector  visually  to  generate  an  error  signal 
for  closed-loop  control.  Some  element  of  open- loop 
control  is  necessary  since  the  end-point  may  not  al¬ 
ways  be  in  the  field  of  view  (for  example,  when  it 
is  in  its  the  resting  position),  and  the  overall  reach¬ 
ing  operation  can  be  made  faster  with  a  feed-forward 
contribution  to  the  control. 

The  simplest  possible  open  loop  control 
would  map  directly  from  a  fixation  point  to 
the  arm  motor  commands  needed  to  reach  that 
point  (Metta  et  al.,  1999)  using  a  stereotyped 
trajectory,  perhaps  using  postural  primitives 
(Mussa-Ivaldi  and  Giszter,  1992).  If  we  can  fix¬ 
ate  the  end-effector,  then  it  is  possible  to  to 
learn  this  map  by  exploring  different  combi¬ 
nations  of  direction  of  gaze  vs.  arm  position 
(Marjanovic  et  al.,  1996,  Metta  et  al.,  1999).  So 
locating  the  end-effector  visually  is  key  both  to 
closed-loop  control,  and  to  training  up  a  feed¬ 
forward  path.  We  shall  demonstrate  that  this 
localization  can  be  performed  without  knowledge  of 
the  arm’s  appearance,  and  without  assuming  that 
the  arm  is  the  only  moving  object  in  the  scene. 


Figure  5:  Detecting  the  arm/gripper  through  motion  cor¬ 
relation.  The  robot’s  point  of  view  and  the  optic  flow 
generated  are  shown  on  the  left.  On  the  right  are  the 
results  of  correlation.  Large  circles  represent  the  results 
of  applying  a  region  growing  procedure  to  the  optic  flow. 
Here  the  flow  corresponds  to  the  robot’s  arm  and  the  ex¬ 
perimenter’s  hand  in  the  background.  The  small  circle 
marks  the  point  of  maximum  correlation,  identifying  the 
regions  that  correspond  to  the  robot’s  own  arm. 


Localizing  the  arm  visually 

The  robot  is  not  a  passive  observer  of  its  arm, 
but  rather  the  initiator  of  its  movement.  This 
can  be  used  to  distinguish  the  arm  from  parts  of 
the  environment  that  are  more  weakly  affected  by 
the  robot.  The  arm  of  a  robot  was  detected  in 
(Marjanovic  et  al.,  1996)  by  simply  waving  it  and  as¬ 
suming  it  was  the  only  moving  object  in  the  scene. 
We  take  a  similar  approach  here,  but  use  a  more 
stringent  test  of  looking  for  optic  flow  that  is  corre¬ 
lated  with  the  motor  commands  to  the  arm.  This 
allows  unrelated  movement  to  be  ignored.  Even  if 
a  capricious  engineer  where  to  replace  the  robot’s 
arm  with  one  of  a  very  different  appearance,  and 
then  stand  around  waving  the  old  arm,  this  detec¬ 
tion  method  will  not  be  fooled. 

The  actual  relationship  between  arm  movements 
and  the  optic  flow  they  generate  is  complex.  Since 
the  robot  is  in  control  of  the  arm,  it  can  choose  to 
move  it  in  a  way  that  bypasses  this  complexity.  In 
particular,  if  the  arm  rapidly  reverses  direction,  the 
optic  flow  at  that  instant  will  change  in  sign,  giving 
a  tight,  clean  temporal  correlation.  Since  our  op- 
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Figure  6:  Mapping  from  proprioceptive  input  to  a  visual 
prediction.  Head  and  ann  joint  positions  are  used  to 
estimate  the  position  of  the  projection  of  the  hand  in  the 
image  plane.  Redundant  configurations  of  the  (7  DOF) 
head  are  mapped  to  a  simpler  (2D)  representation,  and 
the  wrist-related  DOFs  of  the  arm  are  ignored.* 


tic  flow  processing  is  coarse  (a  16  x  16  grid  over  a 
128  x  128  image  at  15  Hz),  we  simply  repeat  this 
reversal  a  number  of  times  to  get  a  strong  correla¬ 
tion  signal  during  training.  With  each  reversal  the 
probability  of  correlating  with  unrelated  motion  in 
the  environment  goes  down.  This  probability  could 
also  be  reduced  by  higher  resolution  (particularly  in 
time)  visual  processing. 

Figure  4  shows  an  example  of  this  procedure  in 
operation,  comparing  the  velocity  of  the  arm’s  wrist 
with  the  optic  flow  at  two  positions  in  the  image 
plane.  A  trace  taken  from  a  position  away  from  the 
arm  shows  no  correlation,  while  conversely  the  flow 
at  a  position  on  the  wrist  is  strongly  different  from 
zero  over  the  same  period  of  time.  Figure  5  shows 
examples  of  detection  of  the  arm  and  rejection  of  a 
distractor. 

Localizing  the  arm  using  proprioception 

The  localization  method  for  the  arm  described  so 
far  relies  on  a  relatively  long  ‘‘signature”  movement 
that  would  slow  down  reaching.  This  can  be  over¬ 
come  by  training  up  a  function  to  estimate  the  loca¬ 
tion  of  the  arm  in  the  image  plane  from  propriocep¬ 
tive  information  (joint  angles)  during  an  exploratory 
phase,  and  using  that  to  constrain  arm  localization 
during  actual  operation.  As  a  function  approxima¬ 
tor  we  simply  fill  a  look-up  table,  reducing  the  11- 
dimensional  input  space  of  joint  angles  based  on  the 
much  lower  number  of  degrees  of  freedom  used  in 
controlling  them  (see  Figure  6).  Figure  7  shows  the 
resulting  behavior  after  about  twenty  minutes  of  real¬ 
time  learning. 

5.  Perceiving  indirect  effects  of  action 

We  have  assumed  that  the  target  of  a  reaching  opera¬ 
tion  is  chosen  visually.  As  discussed  in  the  introduc¬ 
tion,  visual  segmentation  is  not  easy,  so  we  should 


Figure  7:  Predicting  the  location  of  the  arm  in  the  im- 
*  age  as  the  head  and  arm  change  position.  The  rectangle 
represents  the  predicted  position  of  the  arm  using  the 
map  learned  during  a  twenty-minute  training  run.  The 
predicted  position  just  needs  to  be  sufficiently  accurate 
to  initialize  a  visual  search  for  the  exact  position  of  the 
end-effector. 


not  expect  a  target  selected  in  this  way  to  be  a  cor¬ 
rectly  segmented.  For  the  example  scene  in  Figure  2 
(a  cube  sitting  on  a  table),  the  small  iimer  square 
on  the  cube’s  surface  pattern  might  be  selected  as  a 
target.  The  robot  can  certainly  reach  towards  this 
target,  but  grasping  it  would  prove  difficult  without 
a  correct  estimate  of  the  object’s  physical  extent.  In 
this  section,  we  develop  a  procedure  for  refining  the 
segmentation  using  the  same  idea  of  correlated  mo¬ 
tion  used  earlier  to  detect  the  arm. 

When  the  arm  enters  into  contact  with  an  object, 
one  of  several  outcomes  are  possible.  If  the  object 
is  large,  heavy,  or  otherwise  unyielding,  motion  of 
the  arm  may  simply  be  resisted  without  any  visi¬ 
ble  effect.  Such  objects  can  simply  be  ignored,  since 
the  robot  will  not  be  able  to  manipulate  them.  But 
if  the  object  is  smaller,  it  is  likely  to  move  a  little 
in  response  to  the  nudge  of  the  arm.  This  move¬ 
ment  will  be  temporally  correlated  with  the  time  of 
impact,  and  will  be  connected  spatially  to  the  end- 
effector  constraints  that  are  not  available  in  passive 
scenarios  (Birchfield,  1999).  If  the  object  is  reason¬ 
ably  rigid,  and  the  movement  has  some  component  in 
parallel  to  the  image  plane,  the  result  is  likely  to  be 
a  flow  field  whose  extent  coincides  with  the  physical 
boundaries  of  the  object. 

Figure  8  shows  how  a  “poking”  movement  can  be 
used  to  refine  a  target.  During  a  poke  operation, 
the  arm  begins  by  extending  outwards  from  the  rest¬ 
ing  position.  The  end-effector  (or  “flipper”)  is  lo¬ 
calized  as  the  arm  sweeps  rapidly  outwards,  using 
the  heuristic  that  it  lies  at  the  highest  point  of  the 
region  of  optic  flow  swept  out  by  the  arm  in  the  im¬ 
age  (the  head  orientation  and  reaching  trajectory  are 
controlled  so  that  this  is  true).  The  arm  is  driven 
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Begin  Find  end-effector  Sweep  Contact!  Withdraw 


Figure  8:  The  upper  sequence  shows  an  arm  extending  into  a  workspace,  tapping  an  object,  and  retracting.  This  is  an 
exploratory  mechanism  for  finding  the  boundaries  of  objects,  and  essentially  requires  the  arm  to  collide  with  objects 
under  normal  operation,  rather  than  as  an  occasional  accident.  The  lower  sequence  shows  the  shape  identified  from 
the  tap  using  simple  image  differencing  and  flipper  tracking. 


outward  into  the  neighborhood  of  the  target  which 
we  wish  to  define,  stopping  if  an  unexpected  obstruc¬ 
tion  is  reached.  If  no  obstruction  is  met,  the  flipper 
makes  a  gentle  sweep  of  the  area  around  the  target. 
This  minimizes  the  opportunity  for  the  motion  of  the 
arm  itself  to  cause  confusion;  the  motion  of  the  flip¬ 
per  is  bounded  around  the  endpoint  whose  location 
we  know  from  tracking  during  the  extension  phase, 
and  can  be  subtracted  easily.  Flow  not  connected  to 
the  end-effector  can  be  ignored  as  a  distr actor. 

For  simplicity,  the  head  is  kept  steady  throughout 
the  poking  operation,  so  that  simple  image  differenc¬ 
ing  can  be  used  to  detect  motion  at  a  higher  reso¬ 
lution  than  optic  flow.  Because  a  poking  operation 
currently  always  starts  from  the  same  location,  the 
arm  is  localized  using  a  simple  heuristic  rather  than 
the  procedure  described  in  the  previous  section  the 
first  region  of  optic  flow  appearing  in  the  lower  part 
of  the  robot’s  view  when  the  reach  begins  is  assumed 
to  be  the  arm. 

The  poking  operation  gives  clear  results  for  a  rigid 
object  that  is  free  to  move.  What  happens  for  non- 
rigid  objects  and  objects  that  are  attached  to  other 
objects?  Here  the  results  of  poking  are  likely  to  be 
more  complicated  to  interpret  -  but  in  a  sense  this  is 
a  good  sign,  since  it  is  in  just  such  cases  that  the  idea 
of  an  object  becomes  less  well-defined.  Poking  has 
the  potential  to  offer  an  operational  theory  of  “ob- 
jecthood”  that  is  more  tractable  than  a  vision-only 
approach  might  give,  and  which  cleaves  better  to  the 
true  nature  of  physical  assemblages.  The  idea  of  a 
physical  object  is  rarely  completely  coherent,  since 
it  depends  on  where  you  draw  its  boundary  and  that 


Figure  9:  Poking  can  reveal  a  diffence  in  the  shape  of 
two  objects  without  any  prior  knowledge  of  their  appear¬ 
ance. 


may  well  be  task-dependent.  Poking  allows  us  to 
determine  the  boundary  around  a  mass  that  moves 
together  when  disturbed,  which  is  exactly  what  we 
need  to  know  for  manipulation.  As  an  operational 
definition  of  object,  this  has  the  attractive  property 
of  breaking  down  into  ambiguity  in  the  right  circum¬ 
stances  -  such  as  for  large  interconnected  messes, 
floppy  formless  ones,  liquids,  and  so  on. 
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Figure  10:  Mirror  neurons  and  causality:  from  the  ob¬ 
server’s  point  of  view  (A),  understanding  B’s  action 
means  mapping  it  onto  the  observer’s  own  motor  reper¬ 
toire.  If  the  causal  chain  leading  to  the  goal  is  already 
in  place  (lower  branch  of  the  graph)  then  the  acquisition 
of  a  mirror  neuron  for  this  particular  action/object  is  a 
matter  of  building  and  linking  the  upper  part  of  the  chain 
to  the  lower  one.  There  are  various  opportunities  to  rein¬ 
force  this  link  either  at  the  object  level,  at  the  goal  level 
or  both. 


6.  Developing  mirror  neurons? 

Poking  moves  us  one  step  outwards  on  a  causal  chain 
away  from  the  robot  and  into  the  world,  and  gives 
a  simple  experimental  procedure  for  segmenting  ob¬ 
jects.  There  are  many  possible  elaborations  of  this 
method  (some  are  mentioned  in  the  conclusions),  all 
of  which  lead  to  a  vision  system  that  is  tuned  to 
acquiring  data  about  an  object  by  seeing  it  manipu¬ 
lated  by  the  robot.  An  interesting  question  then  is 
whether  the  system  could  extract  useful  information 
from  seeing  an  object  manipulated  by  someone  else. 
In  the  case  of  poking,  the  robot  needs  to  be  able  to 
estimate  the  moment  of  contact  and  to  track  the  arm 
sufficiently  well  to  distinguish  it  from  the  object  be¬ 
ing  poked.  We  are  interested  in  how  the  robot  might 
learn  to  do  this.  One  approach  is  to  chain  outwards 
from  an  object  the  robot  has  poked.  If  someone  else 
moves  the  object,  we  can  reverse  the  logic  used  in 
poking  -  where  the  motion  of  the  manipulator  iden¬ 
tified  the  object  and  identify  a  foreign  manipulator 
through  its  effect  on  the  object.  Poking  is  an  ideal 
testbed  for  future  work  on  this,  since  it  is  much  sim¬ 
pler  than  full-blown  object  manipulation  and  would 
only  require  a  very  simple  model  of  the  foreign  ma¬ 
nipulator  to  work. 

There  is  considerable  precedent  in  the  litera¬ 
ture  for  a  strong  connection  between  viewing  ob¬ 
ject  manipulation  performed  by  either  oneself  or 
another  (Wohlsclager  and  Bekkering,  2002).  As  we 
already  mentioned  F5  contains  a  class  of  neurons 


called  canonical  neurons  that  have  a  very  specific 
response  when  an  object  is  being  either  manipu¬ 
lated  or  fixated.  Grossly  simplifying,  we  might 
think  of  canonical  neurons  as  an  association  table  of 
grasp/manipulation  (action)  types  with  object  (vi¬ 
sion)  types.  Another  class  of  neurons  called  “mirror 
neurons”  can  then  be  thought  of  as  a  second-level  as¬ 
sociation  map  which  links  together  the  observation 
of  a  manipulative  action  performed  by  somebody  else 
with  the  neural  representation  of  one’s  own  action. 

Figure  10  shows  this  causal  chain  in  action.  There 
are  a  series  of  interesting  behaviors  that  can  be  re¬ 
alized  based  on  mirror  neurons.  Mimicry  is  an  ob¬ 
vious  application,  since  it  requires  just  this  type  of 
mapping  between  other  and  self  in  terms  of  motor 
actions.  Another  important  application  is  the  pre¬ 
diction  of  future  behavior  from  current  actions,  or 
even  inverting  the  causal  relation  to  find  the  action 
that  most  likely  will  get  to  the  desired  consequence. 


Figure  11:  The  ultimate  goal  of  this  work  is  for  our  robot 
to  follow  chains  of  causation  outwards  from  its  own  sim¬ 
ple  body  into  the  complex  world. 


7.  Discussion  and  Conclusions 

In  this  paper,  we  showed  how  causality  can  be  probed 
at  different  levels  by  the  robot.  Initially  the  environ¬ 
ment  was  the  body  of  the  robot  itself,  then  later 
a  carefully  circumscribed  interaction  with  the  out¬ 
side  world.  This  is  reminiscent,  of  Piaget’s  distinc¬ 
tion  between  primary  and  secondary  circular  reac¬ 
tions  (Ginsburg  and  Opper,  1978).  Objects  are  cen¬ 
tral  to  interacting  with  the  ouside  world.  We  raised 
the  issue  of  how  an  agent  can  autonomously  acquire 
a  working  definition  of  objects. 

In  computer  vision  there  is  much  to  be  gained  by 
bringing  a  manipulator  into  the  equation.  Many  vari¬ 
ants  and  extensions  to  the  experimental  “poking” 
strategy  explored  here  are  possible.  For  example,  a 
robot  might  try  to  move  an  arm  around  behind  the 
object.  As  the  arm  moves  behind  the  object,  it  re¬ 
veals  its  occluding  boundary.  This  is  a  precursor  to 
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visually  extracting  shape  information  while  actually 
manipulating  an  object,  which  is  more  complex  since 
the  object  is  also  being  moved  and  partially  occluded 
by  the  manipulator.  Another  possible  strategy  that 
could  be  adopted  as  a  last  resort  for  a  confusing  ob¬ 
ject  might  be  to  simply  hit  it  firmly,  in  the  hopes 
of  moving  it  some  distance  and  potentially  overcom¬ 
ing  local,  accidental  visual  ambiguity.  Obviously  this 
strategy  cannot  always  be  used!  But  there  is  plenty 
of  room  to  be  creative  here. 
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